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Abstract. Random forests are a type of ensemble method which makes predic¬ 
tions by combining the results of several independent trees. However, the the¬ 
ory of random forests has long been outpaced by their application. In this pa¬ 
per, we propose a novel random forests algorithm based on cooperative game 
theory. Banzhaf power index is employed to evaluate the power of each feature 
by traversing possible feature coalitions. Unlike the previously used information 
gain rate of information theory, which simply chooses the most informative fea¬ 
ture, the Banzhaf power index can be considered as a metric of the importance of 
each feature on the dependency among a group of features. More importantly, we 
have proved the consistency of the proposed algorithm, named Banzhaf random 
forests (BRF). This theoretical analysis takes a step towards narrowing the gap 
between the theory and practice of random forests for classification problems. 
Experiments on several UCI benchmark data sets show that BRF is competitive 
with state-of-the-art classifiers and dramatically outperforms previous consistent 
random forests. Particularly, it is much more efficient than previous consistent 
random forests. 

Keywords: random forests, Banzhaf power index, cooperative game, classifica¬ 
tion 


1 Introduction 

Ensemble methods are learning algorithms that construct a set of classifiers and com¬ 
bine them to classify new unseen data m. Random forests are a type of ensemble 
method based on combination of several independent decision trees Cl . In recent years, 
the random forests framework and its variants have been successfully applied in prac¬ 
tice as a general classification and regression tool. Particularly, random forests have 
been widely used in computer vision a, 11, Q, 0 and pattern recognition applica¬ 
tions Q, ii, a, Go) , which promotes the state-of-the-art in performance. Despite their 
successful applications, the theoretical analysis of random forest models is still very 
difficult, even the basic mathematical properties are very hard to understood. In m 
and GJl, Biau and colleagues tries to narrow the gap between the theory and practice 
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of random forest. However, the proposed models in these two papers cannot deliver 
effective results and their running is not efficient. 

In this paper, we introduce a novel random forests algorithm based on the cooper¬ 
ative game theory. We adopt the Banzhaf power index to evaluate the power of each 
feature by traversing all possible coalitions. Due to this, we call the proposed algo¬ 
rithm Banzhaf random forests (BRF). Different from the previously used information 
gain rate of information theory, which simply chooses the most informative feature, 
the Banzhaf power index measures the importance of each feature on the dependency 
among a group of features (coalition). More importantly. We reasonable proved the con¬ 
sistency of the forest, it has made a contribution to narrow the theory and practice gap 
for random classification forests problems. 

The rest of this paper is organized as follows. In Section 2, we provide a brief 
overview of existing random forests models and analyze their advantage and disadvan¬ 
tage. In Section 3, we introduce the general random forests framework, including the 
construction of trees and randomness injection. Section 4 describes the proposed al¬ 
gorithm, Banzhaf random forests (BRF), in detail, while Section 5 is devoted to the 
justification of the consistency of BRF. Section 6 shows the experimental results on 
some UCI benchmark data sets and Section 7 concludes this paper. 


2 Related work 

Classic random forests introduced by Breiman dl combine several decision trees M 
with bagging IH . The main idea of random forests is based on the early work of 03 on 
the random subspace method, the feature selection work of ifThll . the way of random split 
selection of El. Based on the seminal work of Breiman m, lEi suggests that it is 
best to average across sets of trees with different structures but not any of the constituent 
trees. Criminisi et al. nsi present a unihed, efficient model of random decision forests 
which can be applied to a number of machine learning, computer vision and medical 
image analysis tasks. With the development of random forests in recent years, they have 
been applied to a wide variety of real world problems Go), im, na, ms. 

Despite the successful applications of random forests in practice, the mathematical 
properties behind them have not been well understood. For example, the early theoret¬ 
ical work of ll24l . which is essentially based on mathematical heuristics, is not formal¬ 
ized to rigorous theory. 

In theory, there are two main properties of theoretical interests related to random 
forests. One is the consistency of the models, that whether it can converge to an optimal 
solution as the data set grows inhnitely large. The other is the rate of convergence. Our 
paper mainly focuses on consistency, which im has proved that Breiman’s random 
forests cannot guarantee. 

To design consistent random forests, many researchers have struggled in this trend. 
Meinshausen ll25l has shown that an algorithm of random forests for quantile regression 
is consistent; Ishwaran and Kogalur 1^ have shown the consistency of their survival 
forests model; Denil et al. EtII show the consistency of an online version of random 
forests, while ll28l presents a new random regression forests. These consistent models 
can be applied to either regression, survival or online settings, but not to batch classi- 
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fication settings where all the training data can be used together for learning. In this 
paper, we propose a novel random forests model based on the cooperative game theory 
for multi-class classification problems. The consistency of the proposed algorithm is 
also proved. 

Two more closely related papers to our work are mi and na. mi proves the 
consistency of some popular averaging classifiers, including random forests. Specifi¬ 
cally, the authors take IS] as a weighted layered nearest neighbor classifier from the 
perspective of taxonomy proposed by ll29l . Unfortunately, this property prevents the 
consistency of random tree classifiers. To remedy the inconsistency of tree classifiers, 
the authors suggest the technique introduced in BOl . Moreover, mi has also proposed 
a scale-invariant version of random forests with consistency. Recently, in presents a 
new model of random forests, which is similar to the original algorithm of m. The 
main difference between these two models is in how random features are selected. m 
requires a second independent data set to evaluate the importance index of each feature 
and uses this property to prove the consistency for their algorithm, while the model 
of 121 doesn’t need the second data set. In this paper, we use the Banzhaf power in¬ 
dex to evaluate the power of each feature by traversing all possible feature coalitions, 
but not employing the second data set. The consistency of the proposed algorithm is 
theoretically guaranteed. 


3 Random Forests 


In this section we briefly review the random forests framework. Typically, random 
forests are built by combining the predictions of several trees, each of which is trained 
in isolation. Unlike in boosting OTl . where the base models are trained and combined 
using a dynamic weighting scheme, the trees are trained independently and the pre¬ 
dictions of the trees are combined through averaging or majority voting. For a more 
comprehensive review, please refer to m and im. 

To construct a random tree, three core steps are required: the first is the method for 
splitting the tree nodes; the second is the type of predictor to use in each leaf, and the 
third is the method of injecting randomness into the trees. 

In a typical method for splitting nodes, splitting depends on whether or not they 
exceed a threshold value in a chosen feature. Alternatively, for linear splits, a linear 
combination of features are compared with a threshold to make decision. The threshold 
value in either case can be chosen randomly or by optimizing a function of the data. For 
example, the Gini index and information gain rate are commonly used. In this paper, we 
choose the midpoint of a feature as the splitting threshold, which leads to the proposed 
algorithm to be very efficient, especially in the case of large scale applications. 

In order to split a node of each tree, candidate features of data are generated and 
a criterion is evaluated to choose between them. A simple strategy, as in the models 
analyzed in M, is to choose among the features uniformly at random. A more com¬ 
mon approach is to choose the candidate split which optimizes a purity function over 
the nodes that would be created. Particularly, two typical choices are to maximize the 
information gain and minimize the Gini index. In our Banzhaf random forests, we 
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use the Banzhaf power index of the cooperative game theory 133], which measures the 
distribution of power among the features on the data sets. 

For the choice of predictors, ifT^ propose several different leaf predictors for re¬ 
gression and other tasks. One common consideration is to average predictors over the 
training points which fall in that leaf. The other consideration may based on majority 
voting with points in that leaf. In our work, we take the last strategy. 

It is important to inject randomness into the trees for random forests. This can be 
achieved in several ways. One choice is on the features to be split at each node; the 
other one is the coefficients for random combinations of features. One common method 
is to build each tree using a bootstrapped or sub-sampled data set. In this way, each tree 
in the forest is trained on slightly different data, which introduces differences between 
the trees. Similar to 0, our work uses a bootstrapped method to inject randomness 
into each tree. 


4 Banzhaf Random Forests 

In this section, we describe the proposed algorithm, Banzhaf random forest (BRF), in 
detail. Firstly, we introduce some basic concepts of cooperative game theory. Secondly, 
based on the Banzhaf power index, we introduce the way to construct the randomized 
trees. Thirdly, we combine the Banzhaf trees to formulate the Banzhaf random forests. 
Finally, we present the prediction method about the Banzhaf random forests. 


4.1 Basic concepts of cooperative game theory 

Cooperative game theory mainly studies an ‘acceptable’ way of distributing gains col¬ 
lectively achieved by a group of cooperating agents ll34l . A cooperative profit game 
r = {Af, 7) consists of a player set Af = {1, 2,..., n} and a characteristic function 
7 : 2^ R. For each subset S C J\f, 7(5) can be interpreted as the profit achieved 
by the players in 5 C JV. The usual goal in cooperative game is to distribute the total 
gain j{Af) of the global coalition J\f among each player in fair and reasonable ways. 
Different requirements on the fairness and rationality derive different solution concepts 
of the cooperative game. Such as the core, the Banzhaf power index and some related 
concepts of approximate core. Among various solution concepts the concept of Banzhaf 
power index that is motivated by fairness. 

For a game F = {Af, 7), if it is monotone, i.e., it satisfied 7(C) < 7(D) for every 
pair of coalitions C,T> C Af such that C CD, and its characteristic function only takes 
value 0 and 1, i.e., 7(5) G {0,1}, VS C Af, this game is called a simple game. In a 
simple game F = {Af, 7), the coalitions with value 1 are called to ‘winning’, and that 
with value 0 are called ‘losing’, i.e., VS C Af, j{S) = 1 and 7(5) = 0, respectively. 
Each coalition S U {i} that wins when S loses is called a swing for player i G Af, 
because the membership of player i in the coalition is crucial to the ’winning’. In fact, 
Banzhaf power index is to count the number of winning coalitions, when the player 
Vi G A/^ joining some losing coalitions, to find the most crucial player that it can let the 
majority of coalitions winning. 
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Banzhaf power index, which yields an unique outcome in coalitional games, is 
proposed to measure the marginal contribution of players in the game 1^ . In simple 
games, the Banzhaf power index have a particular attractive interpretation: it measures 
the power of a player, i.e., the probability that he can influence the outcome of the game. 
In this paper, we use Banzhaf power index to measure the power of each feature. 


4.2 Construction of Banzhaf tree 

Figure shows the structure of a Banzhaf decision tree. For the root node, the feature 
is selected with information gain rate. For all the other nodes, the features are selected 
with the Banzhaf power index. The idea of Banzhaf decision tree are mainly motivated 
by game theory, especially, the cooperative game theory. We take the features of data 
as the players in a game, then the original tree construction problem is transformed into 
a cooperative ‘feature’ game. At each node, features in the form of the coalition are 
selected and the best one is split. 



Next, we first present the way to compute the Banzhaf power index in this work. 

The original definition of Banzhaf power index is described in I^ . Given a coop¬ 
erative game F = (A/", 7 ) with \Af \ = n, the Banzhaf power index of a player i G JV is 
the probability of swings for play i. We denote the Banzhaf power index as fii{r) and 
it is given by 


(I) 

where Ai{S) is the marginal contribution of player i. i.e. Ai{S) = 7(5 U i) — 7(5). 

Banzhaf power index measures the distribution of power among the players in co¬ 
operative games. Here, we apply it for the decision tree construction, attempting to 
estimate the power of each feature for each tree node. The power of each feature can 
be measured by averaging the contributions that it makes to each of the subset which 
it belongs to. Let coalition /C be a candidate feature subset and feature fi{fi ^ (/C)) is 
to be estimated. Define the ratio p = pLi{K.)/pi{K.) to represent the impact of feature 
fi on coalition /C, where pi{JC) can be interpreted as the number of features that fall 
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into interdependence relationship with the feature fi, and pi{IC) be the number of fea¬ 
tures in the coalition K,. Therefore, we dehne a threshold value r. If p < r (commonly 
T — 1 / 2 ), we call the coalition JCU fi ‘losing’, otherwise ‘winning’, i.e. 

= ( 2 ) 


Here, Ai{JCU fi) = 1 means that feature fi is the key to make the coalition to exhibit 
better performance. The threshold value 1/2 means, if more than half of the features 
are interdependent with fi, it will join in the coalition to make it ‘winning’. Hence, for 
simplicity of the computation, we define Ai{S) in Eq. ([^ as 


A{S) 


1 p > r; 
0 p < r. 


(3) 


For clarity, here, we give an example to show how to compute the Banzhaf power in¬ 
dex. Given a cooperative ‘feature’ game F = {Af, 7 ), the feature player set J\f = 
{/ii fii /a, A}- Suppose, currently, the goal is to calculate the Banzhaf power index of 
/ 4 . The total number of possible coalitions of feature subsets A/" \ /4 is 7 (except 0 ), for 
all 5 F M \ fi- Assume the winning coalitions with respect to f^ are {/ 2 }, {/ 2 , fs}, 
{/i, / 2 }, i.e. half of the coalitions are interdependent with feature/ 4 . Then the Banzhaf 
power index of /4 can be computed as 

E 2'.(S) = 3/2. (4) 


Similarly, the value of Banzhaf power index for other features can be computed as the 
same way. Generally, Banzhaf power index is hardly to be zero in large scale and high 
dimensional applications. 

In order to evaluate the impact of feature fi, it needs to calculate the proportion 
of the ‘winning’ coalitions. That will lead to a high computational complexity, but our 
model only randomly selected a small group of features to compute the Banzhaf power 
index at each node. Hence, the computational complexity is fairly low. 

To calculate the proportion of the ‘winning’ coalitions, we use conditional mutual 
information of information theory to evaluate the interdependent between a single fj ^ 
S C Af and the feature player /i S 5 C Af. If more than half of feature players fiGS 
are interdependent fj, then have Aj{S) = 'y{S U j) — 7 ( 5 ) = 1. 

In our paper, the condition mutual information is defined as the amount of the inter¬ 
dependent between feature player fj ^ S and feature player fi G S given the feature 
player eolation S. It is formally dehned by 


nfFMs\f.) = Y.H E log 

xefj yefi zeS\fi 


p{x,y\z) 

p{x\z)p{y\z)' 


(5) 


By Eq. Q, Q and Q, we can get the Banzhaf power index of each feature player for 
the construction of each decision tree. 
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4.3 Banzhaf random forests algorithm 

Given a training data set = {Xi, it includes n samples and the dimension¬ 

ality of data is M. The procedures of the Banzhaf random forests (BRF) algorithm can 
be described as follows. 

- For the construction of each Banzhaf decision tree in BRF, randomly draw n sam¬ 

ples with replacement using bootstrap and randomly select h M features with¬ 
out replacement from the training data. Base on this data set ^i)„x(;i-|-l) — 

Dn = (Xi, Yi)nx{h+i)’ grow ^ recuTsive Banzhaf tree. 

- For the root node, the feature is selected with information gain rate. For all the 
other nodes, the features are selected with the Banzhaf power index. The feature 
associated with the corresponding node is split at the midpoint of the feature values, 
to generate the left and right branches. 

- If a (terminal) node has the percentage of incorrectly assigned samples less than d, 
then stop building the Banzhaf tree, where d is a pre-specified number. 

- BRF predicts the labels of test data based on the votes it received from each Banzhaf 
tree. 

Our algorithm is similar to the original algorithms of El. Both of them used boot¬ 
strap aggregating i.e., bagging ensemble algorithm. The main difference between BRF 
and the algorithm of ^ is in how the feature associated with a node is selected. BRF 
uses Banzhaf power index, while Breiman’s method use the Gini index. Another dif¬ 
ference is, BRF splits each node at the midpoint of the feature values but Breiman’s 
algorithm does not. More importantly, as shown in next section, the consistency of BRF 
is theoretically guaranteed, but that of Breiman’s algorithm is not. 

We have also tested the model of pure Banzhaf random forests, i.e. the feature of the 
root node is also selected via the Banzhaf power index. Their performance is generally 
worse than that of the BRF algorithm described as above. One reason for this result may 
be that the feature selected via information gain rate at the root node may present some 
important invariant information of data. 


4.4 Prediction 


We denote a recursive tree created in the BRF algorithm based on data = (Xi, 
as Qn, where {Xi,Yi)'^^j^ are i.i.d. pairs of random variables such that X (the feature 
vector) takes its value in R'^ while Y (the label) is a multiclass random variable. To 
make a prediction for a query point x, each Banzhaf decision tree computes. 


C^(^) 


1 

N(A^(x)) 


E 


k), 


where An(x) denotes the node of the tree containing x, and N(An(x)) is the number 
of points that located in A(x). Then the tree prediction is the class which maximizes 
that: 


gn(x) = argmax{C^(a;)}. 

k 


The forest predicts the class with the most votes from the individual trees. 
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5 Consistency 

In this section, we prove the consistency of Banzhaf random forests. We denote the 
Banzhaf tree created by Banzhaf random forests trained on data (X^, as The 

consistency of a sequence { 5 „} is defined as follows. 

Definition 1 A sequence of classifier {(;„} is consistent for a given distribution of 
{X, Y), that is, the probability of prediction error of gn converges in probability to the 
Bayesian risk, 


Hgn) = ngnix, 9) ^ Y\Dn) ^ L*, 

as n —> c». Here, 0 denotes the randomness in the tree-building algorithm, Z3„ is the 
training data set and the probability in the convergence is over the random selection of 
Dn- The Bayesian risk is the probability of prediction error of the Bayesian classifier, 
which makes predictions by choosing the class with the highest posterior probability, 
g{x) = argmaxP(y = k\X = x). 

In order to reduce the complexity of the issue, we consider that multi-class classifier 
can be transformed to combination of several binary-class classifier. So, we need to 
prove the consistency of estimators of the posterior distribution of each class. A similar 
result was shown by Denil et al lIZTl . 

Lemma 1 Suppose we have the estimates, for o^^h class posterior (^(x) = 

P(y = k\X = x) and that these estimates are each consistent. The classifier 

gn{x) = argmax{C^(a:)} 
k 

is consistent for the corresponding multi-class classification problem. 

Proof. By definition, the rule 

g{x) = argmax{C*'(a;)} 

k 

achieves the Bayes risk. In the case where all the C^(a;) are equal there is nothing to 
prove, since all choices have the same probability of error. So, suppose there is at least 
one k such that c'^ix) < x) and define 

m{x) = - max{C''(x)|C^x) < 

mn{x) = CtHx) - max{C^(x)|C'=(a:) < 

k 

The function m{x) > 0 is the margin function which measures how much better the 
best choice is than the second best choice. The function m„(a;) measures the margin 
of gn{x). If mn{x) > 0 then gn{x) has the same probability of error as the Bayes 
classifier. 

The assumption above guarantees that there is some e such that m{x) > e. Using C 
to denote the number of classes, by making n large it can satisfy 


mniX)-C'^{X)\<e/2)>l-S/C 
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since is consistent. Thus 

c c 

P( n - C"(^)l < e/2) > 1 - ^ + E P(IC«(^) - < e/2) > 1 - ^ 

k^l k^l 

So with probability at least 1 — 5 we have 

m„(X) = - max{C(X)l(^(X) < 

k 

> - e/2) - max{C^(X) + e/2|C'=(X) < e(")(X)} 

k 

= _ inax{C'=(A:)|C'^(X) < - e > 0 

k 

Since 5 ia arbitrary this means that the risk of gn converges in probability to the Bayes 
risk. 

Lemma 1 allows us to prove the consistency of the multiclass classifier can be trans¬ 
formed to prove the consistency of several two class posterior estimates, i.e., Given a set 
of classes {1,..., c} we can re-assign the labels using the map (X, F) i—(X, Z(F = k)) 
for any fc S {1,..., c} in order to get a two class problem where P(F = 1|X = cc) in 
this new problem is equal to (x) in the original multiclass problem. 

Then, we are inspired by EH. The following Lemma 2 allows us to focus our at¬ 
tention on the consistency of each of the tree estimators in the classification forests. 

Lemma 2 Assume that the sequence {p„} of randomized classifiers is consistent for 
a certain distribution of (X, Y). Then the voting classifier obtained by taking the 
majority vote over M (not necessarily independent) copies of {gn} is also consistent. 

Proof. Let g{x) denote the Bayes classifier. Consistency of {gn} is equivalent to 
saying that E[L(p„)] = P(p„(X, 6) ^ Y) ^ L*. In fact, since P(p„(X, 9) ^ F|X = 
x) > P(p(X) ^ F|X = x) for all x G K^, consistency of {gn} means that p-almost 
all X, 


P(gn(X, 9) ^ F|X = x) ^ P(p(X) ^ F|X = x) = 1 - max{C^x)} 

k 

Define the following indices 

G = {k\({^{x) = max{^*(x)}, B = {fc|^*^(x) < max{C*(x)} 

k k 


Then 


P{gn{X, 9) ^ F|X = x) = E ngn{X, 9) = k\X = x)P(F ^ k\X = x) 

k 

< (1 - max{C'=(x)}) E ngniX, 9) = fc|X = x) + E d) = k\X = x) 

kGG kGB 


which means it suffices to show that ¥{g}i^\x, 9^) = k\X = x) —> 0 for all k G B. 
However, using 9^ to denotes M (possible dependent) copies of 9, for aWk G B we 
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have 



M M 


= k) = v( '^I{gn{x,ej) = k} > max'^I{gr,{x,ej) = c} 


<¥{Y^l{gUx,0j) = k}>l) 


By Markov’s inequality, 


M 


<E['£Hgnix,e,) = k}] 


= M¥{gr,{X,e) = fc) -5> 0. 


According to Lemma 2, we conclude that the consistency of Banzhaf random forests 
is implied by the consistency of the trees which composed of. In addition, we use the 
bagging ensemble method to construct BRF. So by the Theorem 1 in CD, we know 
that the consistency of a voting Banzhaf random forests which follows from the consis¬ 
tency of the base classifier. Here, Biau et al. introduce a parameter qn G [0,1]. In the 
bootstrap sample Dn{9), each data pair {Xi, Yi) is present with probability which is 
independent from each other. 

Theorem 1 Let {p„} be a sequence of classifier that is consistency for the distri¬ 
bution of {X,Y). Consider the Banzhaf random forests (majority voting classifiers) 
(X, 0™, Dn), using parameter g„. If —>■ c» as n —> c» then both classifiers are 

consistent. 

Proof. See that for Theorem 1 in ifTD . 

With Lemma 2 and Theorem 1 established, the remainder of effort goes into proving 
the consistency of a Banzhaf tree construction. For each tree in the Banzhaf forests 
is established based on the Banzhaf index. We show that if a classifier is condition 
consistency which consists of a small group of random variable, and uses the Banzhaf 
power index to sampling for this sample process for this random variable generates 
acceptable sequences with probability 1, then the resulting classifier is unconditionally 
consistent. 

Theorem 2 Suppose {p„} is a sequence of classifiers whose probability of error 
converges conditionally in probability to the Bayes risk L* for a specified distribution 
on (X, Y), i.e. 


P(5„(X,0,/)^r|/)^L* 


for all I S I, / is a random sequence produced by Banzhaf power index, and that v is 
a distribution on I. If v{I) = 1 which means produce acceptable sequence with proba¬ 
bility value is 1, then the probability of error converges unconditionally in probability, 
i.e. 


P(g„(X,0,/)^y)^L*, 
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{g„} is consistent for the specified distribution. 

Proof. The sequence in question is uniformly integrable, so it is sufficient to show 
that E[P(p„(X, 9,1) ^ Y\I)] — L* implies the result, where the expectation is taken 
over the random selection of training set and I is the specific structure of the tree, {(/„}. 
We can write 


P(g„(X, 9,1)^Y)= E[P(g„(X, 9,1) ^ Y\I)] 



By assumption = 0 then we have 



Since probabilities are bounded in the interval [0,1], the dominated convergence theo¬ 
rem allows us to exchange the integral and the limit. 



and by assumption the conditional risk converges to the Bayes risk for all / G I, so 



which is the desired result. 

In fact, let the Banzhaf power index T]{fi) is equal to the income distribution func¬ 
tion 7(/i) in a tree construction game F = (A/”, 7) ,i.e., p(/i) = 7 (/i)- Because we 
chose the maximize Banzhaf power index for each node of each tree. We can obtain a 
acceptable random variable sequence that all with the maximize Banzhaf power index. 
By vifi) = lifi)’ these random variable sequence cooperative can obtain the best re¬ 
sult. So it is sufficient to show that the Banzhaf tree is consistent conditioned on such a 
sequence. 

In conclusion, we proved the consistency of our tree construct by the Theorem 2. 
Because the Theorem 1 is established, we can achieve the consistency of Banzhaf ran¬ 
dom forests. 

6 Experiments 

To evaluate the proposed algorithm, BRF, we tested it on several data sets from the 
UCI machine learning repository, including iris, wine, ecoli, thyroid, soybean, shuttle, 
dermatology, sonar and musk2. We compare it with Breiman’s random forests 12 ) and 
the model proposed in ifT^ . We implemented Breiman’s random forest with C4.5 as 
it generally performs well on classification problems. As mentioned above, the model 
proposed in lfT2l is consistent. For comparison, we also listed the classification results 
yielded by k-nearest neighbor classifier (KNNs) and support vector machines (SVM). 
Table[T]shows the specific information of the used UCI data sets. 
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Datasets 

No.examples 

No.features 

No.classes 

soybean 

47 

35 

4 

iris 

150 

4 

3 

wine 

178 

13 

3 

sonar 

208 

20 

2 

thyroid 

215 

5 

3 

ecoli 

357 

7 

8 

dermatology 

366 

34 

6 

musk2 

6598 

166 

2 

shuttle 

14516 

9 

7 


Table 1. Summary of the used UCl data sets. 


6.1 Effect of the number of trees in BRE 

To evaluate the effect of the number of trees in BRP, we conducted experiments on 
three data sets: iris, ecoli and shuttle. Fig.j^shows the obtained classification accuracy 
against the number of trees in BRP. We can see that, BRF is basically robust with the 
number of trees. Particularly, when the number of trees equals to 100, BRF performs 
slightly better than other values. 



Number of the trees 


Fig. 2. Effect of the number of trees in BRF. 


6.2 Comparison on running efficiency 

To test the running speed of BRF, we performed experiments on seven data sets: iris, 
wine, ecoli, soybean, thyroid, dermatology and shuttle. We compared it with the model 
ofH and that of 112. From Table]^ we can see that, the running of BRF is slower than 
the model of 12. This is mainly because calculation of the Banzhaf power index needs 
some time when constructing the trees. However, BRF is more efficient than the model 
of II2, which is a state-of-the-art consistent random forests model. 
















Lecture Notes in Computer Science: Authors’ Instructions 


13 


Datasets 

BreimanOl 

Biaul2 

BRF 

iris 

1.321 

3.107 

1.654 

wine 

5.401 

16.781 

9.134 

ecoli 

5.729 

17.438 

8.778 

soybean 

0.673 

5.761 

2.297 

thyroid 

2.857 

4.856 

3.168 

dermatology 

2.463 

71.201 

11.023 

shuttle 

49.71 

39600.63 

80.660 


Table 2. Running time of two compared models and BRF on seven UCI data sets (the unit is 
second). 


6.3 Classification results 

To evaluate BRF on multi-class classification problems, we compared it with KNNs, 
SVMs, the model of IS), and the model of ifT^ . Nine UCI data sets were used. They 
are iris, wine, ecoli, thyroid, soybean, shuttle, dermatology, sonar and musk2. For all 
these data sets, we used 5-fold cross validation to test the models. The average clas¬ 
sification accuracies are reported. For the model of ||2| and BRF, we used the same 
number of trees in the random features. Following Breiman’s suggestion for classifica¬ 
tion problems m, we set the number of trees to round{\og2{h) + 1), where h is the 
dimensionality of features. To be fair, we set up the same termination conditions for all 
the random forests models, i.e. the percentage of incorrectly assigned samples at the 
termination node should be no greater than the number of classes on a data set. For 
KNNs and SVMs, we selected the parameter with 5-fold cross validation. 

Table shows the results obtained by the compared models and BRF. We can see 
that BRF performs slightly better than KNNs, SVMs and the model of 111, and consis¬ 
tently better than the model of ifT^ . This demonstrates that using interdependent fea¬ 
tures to construct the randomized trees can lead to better results than using independent 
features in random forests. 


Datasets 

KNN 

SVM 

BreimanOl 

Biaul2 

BRF 

soybean 

1.0000 

1.0000 

1.0000 

0.5717 

1.0000 

iris 

0.9467 

0.9867 

0.9467 

0.8353 

0.9467 

wine 

0.9423 

0.6782 

0.9599 

0.5580 

0.9717 

sonar 

0.5908 

0.6583 

0.7032 

0.5819 

0.7120 

thyroid 

0.9395 

0.9023 

0.9488 

0.8000 

0.9395 

ecoli 

0.8356 

0.8431 

0.5958 

0.4286 

0.6665 

dermatology 

0.9656 

0.9540 

0.9589 

0.4397 

0.9677 

musk2 

0.7227 

0.8508 

0.8509 

0.6542 

0.8710 

shuttle 

0.9951 

0.9752 

0.9957 

0.8256 

0.9957 


Table 3. Classification accuracy obtained by the compared models and BRF on the UCI data sets. 
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7 Conclusion 

In this paper, we propose a novel random forests model called Banzhaf random forests 
(BRF) based on the concepts of the cooperative game theory. It’s consistency is proved, 
which takes a step towards narrowing the gap between the theory and practice of ran¬ 
dom forest. This work is probably the first one that apply the cooperative game theory 
to random forests, and we have tested and verified the feasibility of the idea. Experi¬ 
ments on UCI data sets show that BRF not only slightly outperforms state-of-the-art 
classihers, including KNNs, SVMs and the random forests model by Breiman m, but 
much more efficient than existing consistent random forests. 


Acknowledgment 

This research was supported by the National Natural Science Foundation of China 
(NSFC) under Grant no. 61271405 and61403353, and the Fundamental Research Funds 
for the Central Universities of China. 


References 

1. Zhou, Zhi-Hua: Ensemble methods: foundations and algorithms. CRC Press (2012) 

2. Breiman, Leo.: Random forests. Machine learning, vol. 45, pp. 5-32. Springer (2001) 

3. Lepetit, Vincent and Fua, Pascal: Keypoint recognition using randomized trees. Pattern Anal¬ 
ysis and Machine Intelligence, IEEE Transactions on, vol. 28, pp. 1465-1479. IEEE (2006) 

4. Ozuysal, Mustafa and Fua, Pascal and Lepetit, Vincent: Fast keypoint recognition in ten lines 
of code. Computer Vision and Pattern Recognition, 2007, CVPR’07. pp. 1-8. leee (2007) 

5. Shotton, Jamie and Sharp, Toby and Kipman, Alex and Fitzgibbon, Andrew and Finocchio, 
Mark and Blake, Andrew and Cook, Mat and Moore, Richard: Real-time human pose recog¬ 
nition in parts from single depth images. Communications of the ACM, vol. 56, pp. 116-124. 
ACM (2013) 

6. Zikic, Darko and Glocker, Ben and Criminisi, Antonio: Atlas encoding by randomized 
forests for efficient label propagation. Medical Image Computing and Computer-Assisted 
Intervention-MICCAI 2013, pp. 66-73. Springer (2013) 

7. Winn, John and Criminisi, Antonio: Object class recognition at a glance. In Video Proc. 
CVPR (2006) 

8. Yin, Pei and Criminisi, Antonio and Winn, John and Essa, Irfan: Tree-based classifiers for bi¬ 
layer video segmentation. Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE 
Conference on, pp. 1-8. IEEE (2007) 

9. Bosch, Anna and Zisserman, Andrew and Muoz, Xavierlmage classification using random 
forests and ferns. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference 
on, pp. 1-8. IEEE (2007) 

10. Shotton, Jamie and Johnson, Matthew and Cipolla, Roberto: Semantic texton forests for im¬ 
age categorization and segmentation. Computer vision and pattern recognition, 2008. CVPR 
2008. IEEE Conference on, pp. 1-8. IEEE (2008) 

11. Biau, Gerard and Devroye, Luc and Lugosi, Gabor: Consistency of random forests and other 
averaging classifiers. The Journal of Machine Learning Research, vol. 9, pp. 2015-2033. 
JMLR. org (2008) 


Lecture Notes in Computer Science: Authors’ Instructions 


15 


12. Biau, Gerard: Analysis of a random forests model. The Journal of Machine Learning Re¬ 
search, vol. 13, pp. 1063-1095. JMLR. org (2012) 

13. Breiman, Leo and Friedman, Jerome and Stone, Charles J and Olshen, Richard A.: Classifi¬ 
cation and regression trees. CRC press (1984) 

14. Breiman, Leo.: Bagging predictors. Machine learning, vol. 24, pp. 123-140. Springer (1996) 

15. Ho, Tin Kam: The random subspace method for constructing decision forests. Pattern Anal¬ 
ysis and Machine Intelligence, IEEE Transactions on, vol. 20, pp. 832-844. IEEE (1998) 

16. Amit, Yali and Geman, Donald: Shape quantization and recognition with randomized trees. 
Neural computation, vol. 9, pp. 1545-1588. MIT Press (1997) 

17. Dietterich, Thomas G.: An experimental comparison of three methods for constructing en¬ 
sembles of decision trees: Bagging, boosting, and randomization. Machine learning, vol. 40, 
pp. 139-157. Springer (2000) 

18. Kwok, Suk Wah and Carter, Chris: Multiple decision trees. arXiv preprint arXiv: 1304.2363 
(2013) 

19. Criminisi, Antonio and Shotton, Jamie and Konukoglu, Ender: Decision forests: A unified 
framework for classification, regression, density estimation, manifold learning and semi- 
supervised learning. Foundations and Trends(R) in Computer Graphics and Vision, pp. 81- 
227 (2012) 

20. Svetnik, Vladimir and Liaw, Andy and Tong, Christopher and Culberson, J Christopher and 
Sheridan, Robert P and Feuston, Bradley P: Random forest: a classification and regression 
tool for compound classification and QSAR modeling. Journal of chemical information and 
computer sciences, vol. 43, pp. 1947-1958. ACS Publications (2003) 

21. Prasad, Anantha M and Iverson, Louis R and Liaw, Andy: Newer classification and regres¬ 
sion tree techniques: bagging and random forests for ecological prediction. Ecosystems, vol. 
9, pp. 181-199. Springer (2006) 

22. Cutler, D Richard and Edwards Jr, Thomas C and Beard, Karen H and Cutler, Adele and 
Hess, Kyle T and Gibson, Jacob and Lawler, Joshua J.: Random forests for classification in 
ecology. Ecology, vol. 88, pp. 2783-2792. Eco Soc America (2007) 

23. Criminisi, Antonio and Shotton, Jamie: Decision forests for computer vision and medical 
image analysis. Springer Science & Business Media (2013) 

24. Breiman, Leo.: Consistency for a simple model of random forests. Statistical Department, 
University of California at Berkeley. Technical Report, (2004) 

25. Meinshausen, Nicolai: Quantile regression forests. The Journal of Machine Learning Re¬ 
search, vol. 7, pp. 983-999. JMLR. org (2006) 

26. Ishwaran, Hemant and Kogalur, Udaya B.: Consistency of random survival forests. Statistics 
& probability letters, vol. 80, pp. 1056-1064. Elsevier (2010) 

27. Denil, Misha and Matheson, David and de Ereitas, Nando: Consistency of online random 
forests. arXiv preprint arXiv: 1302.4853 (2013) 

28. Denil, Misha and Matheson, David and De Freitas, Nando: Narrowing the gap: Random 
forests in theory and in practice. arXiv preprint arXiv: 1310.1415, (2013) 

29. Lin, Yi and Jeon, Yongho: Random forests and adaptive nearest neighbors. Journal of the 
American Statistical Association, vol. 101, pp. 578-590. Taylor & Francis (2006) 

30. Gybrfi, L and Devroye, L and Lugosi, G.: A probabilistic theory of pattern recognition. 
Springer-Verlag, (1996) 

31. Schapire, Robert E and Freund, Yoa: Boosting: Foundations and Algorithms. Kybernetes, 
vol. 42, pp. 164-166. Emerald Group Publishing Limited (2013) 

32. Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome and Hastie, T and Friedman, J 
and Tibshirani, R.: The elements of statistical learning, vol. 2. Springer(2009) 

33. Banzhaf III, John F: Weighted voting doesn’t work: A mathematical analysis. Rutgers L. 
Rev., vol. 19, pp. 317. HeinOnlined (1964) 


16 


J. Sun, G. Zhong, Y. Cai and J. Dong 


34. Chalkiadakis, Georgios and Elkind, Edith and Wooldridge, Michael: Computational aspects 
of cooperative game theory. Synthesis Lectures on Artificial Intelligence and Machine Learn¬ 
ing, vol. 5, pp. 1-168. Morgan & Claypool Publishers (2011) 



